Due to the small nature of our dataset, we will build a model using all available data first, while attempting to be careful to avoid overfitting, and then formally evaluate it using cross validation at the end of our analysis.
Our data comes from four different datasets. We used three of Riguang Wen's datasets from figshare.com -- , , and . We also used a dataset called uploaded to zenodo.org by Pablo Gomez and Sandra Giral. From these datasets, we considered the following variables.
| Variable | Description | Type | Source |
|---|---|---|---|
| Player | Name of player | Character | |
| Age | Age of player | Numeric | |
| G | Games played | Numeric | |
| GS | Games started | Numeric | |
| MP | Minutes played | Numeric | |
| PER | Player efficiency rating | Numeric | |
| PTS | Points | Numeric | |
| X3PAr | 3PA/FGA | Numeric | |
| FTr | FTA/FGA | Numeric | |
| TS | True shooting percentage | Numeric | |
| ORB | Offensive rebounds | Numeric | |
| DRB | Defensive rebounds | Numeric | |
| TRB | Total rebounds | Numeric | |
| AST | Assists | Numeric | |
| STL | Steals | Numeric | |
| BLK | Blocks | Numeric | |
| TOV | Turnovers | Numeric | |
| USG | Usage percentage | Numeric | |
| ORtg | Offensive rating | Numeric | |
| DRtg | Defensive rating | Numeric | |
| OWS | Offensive win shares | Numeric | |
| DWS | Defensive win shares | Numeric | |
| WS | Win shares | Numeric | |
| WS.48 | Win shares per 48 minutes | Numeric | |
| OBPM | Offensive box +/- | Numeric | |
| DBPM | Defensive box +/- | Numeric | |
| BPM | Box +/- | Numeric | |
| VORP | Value over replacement player | Numeric | |
| Pos | Position | Factor | |
| Ht | Height in inches | Numeric | |
| Wt | Weight in pounds | Numeric | |
| PwrSix | Power Six College? | Indicator | |
| International | International Player? | Indicator | |
| Salary | Salary in dollars | Numeric |
Immediately we can recognize that some variables are functions of others and therefore do not need to be considered. Specifically, , so there is no need to include in our model. Similarly, and , so we can exclude and from consideration if we include , , and in our model.
data_init <- read.csv("data/PlayerData.csv")[,-1] %>% mutate(logsal = log(SALARY))
data <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Conference)
attach(data)
From our EDA, we concluded that a log transformation of the response variable, salary, would be appropriate based on the skewed distribution.
We also noticed that many of the covariates had skewed distributions, or did not have linear marginal relationships with the response log(salary). We attempted to make these variables approximately normally distirbutied, with an approximately linear marginal relationship with the response. We experimented with log, square root, and squaring transformations, and ended up considering the following transformations in our model:
We can also see that Position has some redundancy and maybe too much granularity; for example "F-C" (forward/center) and "C-F" (center/forward) are treated as different positions by the model when functionally they are the same. We cleaned this up by grouping into "Guards", "Wings", and "Bigs". From the graph it is unclear if there is a significant relationship between the refined position predictor and log(salary).
pos_box = data_init %>% ggplot(aes(x = Pos, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (unrefined)")
pos_cat_box = data_init %>% ggplot(aes(x = Pos_cat, y = log(SALARY))) + geom_boxplot() + ggtitle("log(salary) by position (refined)")
grid.arrange(pos_box, pos_cat_box, nrow = 1)
Below we used VORP as an example; we can see the distribution of VORP and its plot against log(salary) pre and post transformation:
It's not perfect, but the transformed variable appears to be much more appropriate for a linear model than the raw variable. Once our predictors are appropriately transformed, we consider all pairwise correlations between covariates as an initial search for possibly collinearity.
corrplot(cor(data[sapply(data, is.numeric)]))
Four pairs of predictors with noticably large correlations according the the correlation matrix are further investigated. We learn that and have a correlation coefficient of \(0.947\), and have a correlation coefficient of \(0.888\), and have a correlation coefficient of \(0.864\), and finally and have a correlation coefficient of \(-0.760\). To avoid redundancy in the model, we'll remove one predictor in each of the four pairs from the model. The terms' variance inflation factors will guide the decision regarding which predictor to drop. Using GVIF\(^\frac{1}{2df}\) will allow the GVIFs to be comparable across dimensions. Thus, we remove (8.77 > 6.25), (4.64 > 4.29), (8.42 > 7.42), and (4.29 > 3.55).
full = lm(logsal ~ ., data = data)
vif(full)[,3]
## Age G gs_adj MP PER
## 1.119102 3.000854 1.860765 6.240509 7.400986
## X3PAr ftr_adj orb_adj DRB. AST.
## 2.495987 1.442517 2.554112 2.256841 3.072991
## stl_adj blk_adj TOV. USG. ORtg
## 1.858169 2.364577 2.885804 5.059147 6.446025
## DRtg ows_adj dws_adj WS.48 OBPM
## 4.268697 3.069610 4.543134 8.393912 6.126404
## DBPM vorp_adj Pos_cat Height pts_adj
## 3.542653 3.301813 1.577793 2.394156 8.743606
## TS. International Conference
## 4.631713 1.113687 1.081437
data <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST.,
stl_adj, blk_adj, TOV., USG., ORtg, ows_adj, dws_adj,
OBPM, DBPM, vorp_adj, Pos_cat, Height, International,
Conference)
We will start by looking at a full model with all remaining predictors. We observe that in this case, Age, MP, vorp_adj, and Multi-team affiliation have very significant relationships with the log of salary, even after accounting for the other predictors. Other variables that have moderately significant relationships with the log of salary even after accounting for the other predictors are G, orb_adj, USG., and ows_adj.
#all_models = regsubsets(SALARY ~ ., force.in = 1, data = data, nbest = max(choose(n_predictors, 0:n_predictors)), really.big = T, nvmax = 13)
null = lm(logsal ~ 1, data = data)
full = lm(logsal ~ ., data = data)
summary(full)
##
## Call:
## lm(formula = logsal ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.12733 -0.46812 0.04338 0.55034 1.86549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.7021392 2.9814207 2.919 0.003736 **
## Age 0.0658064 0.0106113 6.202 1.54e-09 ***
## G -0.0181663 0.0055382 -3.280 0.001139 **
## gs_adj 0.0194856 0.0475846 0.409 0.682422
## MP 0.0007226 0.0001837 3.933 0.000101 ***
## PER -0.0871335 0.0465252 -1.873 0.061907 .
## X3PAr 0.2832213 0.5158578 0.549 0.583327
## ftr_adj -0.4945672 0.3990974 -1.239 0.216078
## orb_adj 0.3410770 0.1307090 2.609 0.009450 **
## DRB. -0.0136510 0.0159678 -0.855 0.393174
## AST. 0.0189639 0.0094084 2.016 0.044586 *
## stl_adj 0.1826712 0.2372052 0.770 0.441750
## blk_adj -0.1199107 0.1824635 -0.657 0.511490
## TOV. 0.0036909 0.0141669 0.261 0.794603
## USG. 0.0728354 0.0251642 2.894 0.004032 **
## ORtg 0.0150725 0.0144326 1.044 0.297032
## ows_adj 1.2644686 0.3969230 3.186 0.001571 **
## dws_adj 0.6322679 0.2681370 2.358 0.018911 *
## OBPM 0.1038468 0.0754130 1.377 0.169360
## DBPM 0.1310865 0.0565073 2.320 0.020913 *
## vorp_adj -0.9380214 0.1972369 -4.756 2.87e-06 ***
## Pos_catG -0.4680186 0.2487174 -1.882 0.060684 .
## Pos_catWing -0.2256998 0.1571007 -1.437 0.151689
## Height 0.0194139 0.0290128 0.669 0.503831
## InternationalYes -0.0264050 0.1100976 -0.240 0.810597
## ConferenceMulti -0.6825231 0.1578069 -4.325 1.98e-05 ***
## ConferenceWest -0.0928064 0.0952214 -0.975 0.330398
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8297 on 358 degrees of freedom
## Multiple R-squared: 0.557, Adjusted R-squared: 0.5249
## F-statistic: 17.32 on 26 and 358 DF, p-value: < 2.2e-16
We'll next run a stepwise model search using both AIC and BIC to provide another starting point:
bic = log(nrow(data))
stepbic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = bic)
stepaic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = 2)
print(stepaic_model$call)
## lm(formula = logsal ~ MP + Age + Conference + USG. + DBPM + orb_adj +
## Pos_cat + AST. + vorp_adj + OBPM + G + PER + ows_adj + dws_adj,
## data = data)
print(stepbic_model$call)
## lm(formula = logsal ~ MP + Age + Conference + USG. + DBPM, data = data)
From this initial analysis, some important predictors appear to be MP, Age, Conference, USG., and DBPM. We can see that the 'conference' factor is really picking up on players that appeared on multiple teams. Sometimes these are good players who have been traded, but often these players are end of the bench guys that sign cheap, short term contracts. This having a relationship with salary would make sense. We created in our preprocessing script a new variable that just flags players that appeared on multiple teams within the season.
We'll look at some diagnostic plots for these models to see where we're at:
stepaic_resid_plot <- data.frame(resid=stepaic_model$residuals, fitted_logsal=stepaic_model$fitted.values) %>% ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise AIC model residual plot")
stepbic_resid_plot <- data.frame(resid=stepbic_model$residuals , fitted_logsal=stepbic_model$fitted.values) %>% ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise BIC model residual plot")
grid.arrange(stepaic_resid_plot, stepbic_resid_plot, nrow = 1)
We can see that there are some problems with these residual plots. The densest areas of each plot appear to be following a downward trend, and the variance is not constant across all fitted values, i.e., these are not null plots. The models need improvement. It is also noteworthy that despite AIC including several more predictors, the residual plots are very similar.
null = lm(logsal ~ 1, data = data)
full = lm(logsal ~ ., data = data)
summary(full)
bic = log(nrow(data))
stepbic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = bic)
stepaic_model <- stepAIC(object = null, scope = list(lower = null, upper = full),
direction = "both", k = 2)
We will build some basic models on the data after these adjustments, to get an idea of what predictors are still important:
data_trimmed <- data_init %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Multiteam)
trimmed_null = lm(logsal ~ 1, data = data_trimmed)
trimmed_full = lm(logsal ~ ., data = data_trimmed)
corrplot(cor(data_trimmed[sapply(data_trimmed, is.numeric)]), method="number")
trimmed_aic <- stepAIC(object = trimmed_null, scope = list(lower = trimmed_null, upper = trimmed_full),
direction = "both", k = 2)
summary(trimmed_full)
##
## Call:
## lm(formula = logsal ~ ., data = data_trimmed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.97747 -0.44726 0.02749 0.55182 2.12532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.7887241 5.7929454 1.862 0.063374 .
## Age 0.0619535 0.0105083 5.896 8.69e-09 ***
## G -0.0211991 0.0055688 -3.807 0.000166 ***
## gs_adj -0.0041681 0.0482924 -0.086 0.931269
## MP -0.0001719 0.0003178 -0.541 0.588925
## PER 0.0339671 0.0610374 0.556 0.578222
## X3PAr 0.3452160 0.5324448 0.648 0.517171
## ftr_adj -0.3683558 0.4140906 -0.890 0.374308
## orb_adj 0.1968777 0.1632128 1.206 0.228519
## DRB. -0.0263340 0.0160502 -1.641 0.101739
## AST. 0.0049831 0.0138668 0.359 0.719542
## stl_adj -0.0978102 0.2632827 -0.372 0.710485
## blk_adj -0.3489851 0.1926931 -1.811 0.070972 .
## TOV. 0.0267741 0.0249610 1.073 0.284162
## USG. -0.0129629 0.0417970 -0.310 0.756637
## ORtg 0.0500782 0.0235417 2.127 0.034092 *
## DRtg -0.0297096 0.0450437 -0.660 0.509956
## ows_adj 0.7769953 0.4639111 1.675 0.094838 .
## dws_adj 0.7746716 0.3594889 2.155 0.031840 *
## WS.48 -9.0274781 4.7809750 -1.888 0.059813 .
## OBPM 0.0728867 0.0936530 0.778 0.436932
## DBPM 0.1185353 0.0726130 1.632 0.103477
## vorp_adj -0.6613434 0.2286878 -2.892 0.004065 **
## Pos_catG -0.4969772 0.2457386 -2.022 0.043886 *
## Pos_catWing -0.2122213 0.1554118 -1.366 0.172948
## Height 0.0224363 0.0288497 0.778 0.437266
## pts_adj 0.0998109 0.0362844 2.751 0.006250 **
## TS. -4.2719066 3.1091010 -1.374 0.170308
## InternationalYes -0.0032859 0.1099257 -0.030 0.976170
## Multiteam -0.5938917 0.1483221 -4.004 7.58e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8198 on 355 degrees of freedom
## Multiple R-squared: 0.5712, Adjusted R-squared: 0.5361
## F-statistic: 16.3 on 29 and 355 DF, p-value: < 2.2e-16
print(trimmed_aic$call)
## lm(formula = logsal ~ pts_adj + Age + Multiteam + DBPM + G +
## vorp_adj + orb_adj + blk_adj + Pos_cat + AST. + DRB. + dws_adj +
## X3PAr + ows_adj, data = data_trimmed)
From this point, we can see that adjusted VORP, adjusted points, Age, G, and Multiteam are the most important predictors, with many others appearing to be useful. We'll look at residuals of a basic model and see where we're at:
trimmed_resid_plot <- data.frame(resid=trimmed_aic$residuals , fitted_logsal=trimmed_aic$fitted.values) %>% ggplot(aes(x=fitted_logsal, y=resid)) + geom_point() + ggtitle("Stepwise AIC model residual plot")
trimmed_resid_plot
The residual cloud is similar to those before we cut unnecessary predictors.
Now, we'll look at some of the highest leverage cases.
show_diagnostics <- function(model){
x = model.matrix(model)
h = x %*% solve(t(x) %*% x) %*% t(x)
leverages = diag(h)
lev_resid_plot <- data.frame(resid=model$residuals , fitted_logsal=model$fitted.values, leverage=leverages) %>% ggplot(aes(x=fitted_logsal, y=resid)) + geom_point(aes(size=leverages, color=leverages)) + ggtitle("Residual plot with leverages")
cooks = cooks.distance(model)
cook_resid_plot <- data.frame(resid=model$residuals , fitted_logsal=model$fitted.values, cook_distance=cooks) %>% ggplot(aes(x=fitted_logsal, y=resid)) + geom_point(aes(size=cook_distance, color=cook_distance)) + ggtitle("Residual plot with cooks distance")
print(lev_resid_plot)
print(cook_resid_plot)
}
show_diagnostics(trimmed_aic)
There's an outlying case with a large leverage and cooks distance; we'll use a t-test for a mean shift to formally test if this is an outlier:
test_outlier_meanshift <- function(model, ind){
p = length(coef(model))
n = nrow(model.matrix(model))
ri = rstandard(model)[ind]
t_stat = ri * sqrt(((n - p - 1) / (n - p - (ri ^ 2))))
pval = pt(t_stat, n - p - 1)
return (pval)
}
test_outlier_meanshift(trimmed_aic, which.max(cooks.distance(trimmed_aic)))
## 348
## 0.02061654
We notice that the p-value for a mean shift for this case is small. Looking at the data for this particular observation, we see that it is an outlier on many levels - this player had 2 games played, 6 minutes, a wildly high PER and Usage rate, 0 for many counting stats (Assists, steals, etc.), and a very small salary of $30000. Many of these stats are unstable due to the very small number of minutes played. Also, this player is Thanasis Antetokounmpo, who is a unique case because many people consider him to be a non-NBA level player, who only got into the league because his brother Giannis is one of the best players in the world. Thus, we feel he is not a useful data point, and are comfortable removing him from the dataset. We refit a basic model without him.
data <- data_init %>% filter(Player != "Thanasis Antetokounmpo")
data_trimmed <- data %>% mutate(logsal = log(SALARY)) %>% dplyr::select(logsal, Age, G, gs_adj, MP, PER, X3PAr, ftr_adj, orb_adj, DRB., AST., stl_adj, blk_adj, TOV., USG., ORtg, DRtg, ows_adj, dws_adj, WS.48, OBPM, DBPM, vorp_adj, Pos_cat, Height, pts_adj, TS., International, Multiteam)
trimmed_null = lm(logsal ~ 1, data = data_trimmed)
trimmed_full = lm(logsal ~ ., data = data_trimmed)
vif(trimmed_full)
summary(trimmed_full)
trimmed_aic <- stepAIC(object = trimmed_null, scope = list(lower = trimmed_null, upper = trimmed_full),
direction = "both", k = 2)
show_diagnostics(trimmed_aic)
We can also use VIFs to evaluate redundancy in data:
full = lm(logsal ~ ., data = data_trimmed)
vif(full)[,3]
## Age G gs_adj MP PER
## 1.112267 2.987726 1.853649 6.481753 7.361902
## X3PAr ftr_adj orb_adj DRB. AST.
## 2.566367 1.437117 2.780883 2.236671 3.315685
## stl_adj blk_adj TOV. USG. ORtg
## 1.817906 2.356146 2.908801 5.029606 6.356921
## DRtg ows_adj dws_adj WS.48 OBPM
## 4.267611 3.089980 4.606512 8.263129 6.250622
## DBPM vorp_adj Pos_cat Height pts_adj
## 3.421618 3.300426 1.580763 2.403358 9.664145
## TS. International Multiteam
## 4.667096 1.116171 1.086591
#full2 = lm(logsal ~ . - WS.48 - PER - ORtg - PTS, data = data) # Sequentially removed highest VIF until next iteration lowered both Adj R2 and Mult R2